Using Association Rules of the Online Retail Dataset

1 Load Data

We first want to load our datasets and prepare them for some simple association rules mining.

tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")

tnx_data_tbl %>% glimpse()
## Rows: 1,021,424
## Columns: 23
## $ row_id            <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet       <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id        <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code        <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description       <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity          <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date      <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price             <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id       <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country           <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr    <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm      <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month     <chr> "December", "December", "December", "December", "Dec…
## $ invoice_dow       <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ invoice_dom       <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour      <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute    <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy       <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym        <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value       <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.

tnx_purchase_tbl <- tnx_data_tbl %>%
  filter(
    quantity > 0,
    price > 0,
    exclude == FALSE
    ) %>%
  select(
    invoice_id, stock_code, customer_id, quantity, price, stock_value,
    description
    )

tnx_purchase_tbl %>% glimpse()
## Rows: 992,023
## Columns: 7
## $ invoice_id  <chr> "489434", "489434", "489434", "489434", "489434", "489434"…
## $ stock_code  <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "2…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "130…
## $ quantity    <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3,…
## $ price       <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.75…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 3…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHTS…

We now write this data out as a CSV so arules can read it in and process it.

tnx_purchase_tbl %>% write_csv("data/tnx_purchase_tbl.csv")

We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.

stock_code_lookups_tbl <- read_rds("data/stock_code_lookup_tbl.rds")

stock_code_lookups_tbl %>% glimpse()
## Rows: 4,725
## Columns: 2
## $ stock_code_upr <chr> "10002", "10002R", "10080", "10109", "10120", "10123C",…
## $ desc           <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", …

2 Basket Analysis with Association Rules

We now want to do some basic basket analysis using association rules, which tries to determine which items get bought together, similar to taking a graph approachin many ways.

basket_arules <- read.transactions(
    file   = "data/tnx_purchase_tbl.csv",
    format = "single",
    sep    = ",",
    header = TRUE,
    cols   = c("invoice_id", "stock_code")
    )

basket_arules %>% glimpse()
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  4895 obs. of  1 variable:
##   .. ..$ labels: chr [1:4895] "10002" "10002R" "10080" "10109" ...
##   ..@ itemsetInfo:'data.frame':  39510 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:39510] "489434" "489435" "489436" "489437" ...

Now that we have this data we can look at some basic plots much like we produced before. For example, we can look at the relative frequency of the different items.

itemFrequencyPlot(basket_arules, topN = 20)

itemFrequencyPlot(basket_arules, topN = 20, type = "absolute")

The stock codes do not mean a huge amount to us, so we also want to look at the description field for these items.

freq_codes <- itemFrequency(basket_arules) %>%
  sort(decreasing = TRUE) %>%
  head(20) %>%
  names()

tnx_purchase_tbl %>%
  select(stock_code, description) %>%
  filter(stock_code %in% freq_codes) %>%
  distinct() %>%
  drop_na(description) %>%
  group_by(stock_code) %>%
  summarise(
    .groups = "drop",
    desc = description %>% sort() %>% unique() %>% str_c(collapse = " : ")
    ) %>%
  arrange(stock_code) %>%
  datatable()

2.1 Basic Concepts

The basic ideas of association rule mining and basket analysis draws on basic ideas from probability theory.

We speak in terms of the itemset: that is, a collection of one or more items that co-occur in a transaction.

For example, suppose we have a list of transactions as follows:

ID Items
1 milk, bread
2 bread, butter
3 beer
4 milk, bread, butter
5 bread, butter

Using the above set of transactions, and itemset may be “milk” or “bread, butter”.

The support of an itemset \(X\), \(\text{Supp}(X)\), is defined as the proportion of transactions in the dataset which contain the itemset.

In the above example:

\[ \text{Supp}(\text{\{milk, bread\}}) = \frac{2}{5} = 0.40. \]

A rule, \(X \Rightarrow Y\), between two itemsets \(X\) and \(Y\) is a directed relationship of the itemset \(X\) showing the presence of \(Y\). The rule is not symmetric: \(X \Rightarrow Y\) and \(Y \Rightarrow X\) are not the same.

The confidence for the rule \(X \implies Y\), \(\text{Conf}(X \Rightarrow Y)\) is defined by

\[ \text{Conf}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X)}. \]

So, to calculate the confidence for a rule:

\[ \text{Conf}(\text{\{milk, bread\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{0.4} = 0.5. \]

To illustrate how rules are not symmetric:

\[ \text{Conf}(\text{\{butter\}} \Rightarrow \text{\{milk, bread\}}) = \frac{0.2}{0.6} = 0.33. \]

Finally, we want a measure of the strength of the relationship between the itemsets \(X\) and \(Y\). That is, measuring the effect of the presence of \(X\) on the presence of \(Y\). We measure this by defining the lift of a rule as

\[ \text{Lift}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X) \text{Supp}(Y)}. \]

Again, we repeat our calculations for our rule.

\[ \text{Lift}(\text{\{bread, milk\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{(0.4)(0.6)} = \frac{0.2}{0.24} = 0.8333 \]

Lift values greater than 1 implies the presence of \(X\) increases the probability of \(Y\) being present when compared to the unconditional probability.

Now that we have these metrics and concepts, we can turn our attention to trying to find rules in a given dataset, using these metrics to rank them.

Rather than using brute-force approaches to discovering these rules, we use a number of different algorithms to find associations within the dataset.

The two main algorithms for discovering some rules are the apriori and the eclat algorithms.

2.2 Construct apriori Rules

We now want to construct the association rules using the apriori algorithm. To do this, we need to set parameters such as the minimum support and the minimum confidence level.

This gives us a set of association rules, along with the support and lift.

basket_apriori <- apriori(
    basket_arules,
    parameter = list(supp = 0.005, conf = 0.8)
    )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 197 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4895 item(s), 39510 transaction(s)] done [0.35s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.18s].
## writing ... [537 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
basket_apriori_tbl <- basket_apriori %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))

basket_apriori_tbl %>% glimpse()
## Rows: 537
## Columns: 6
## $ rules      <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "{2…
## $ support    <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.00528…
## $ confidence <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.93…
## $ coverage   <dbl> 0.005694761, 0.005416350, 0.005669451, 0.005416350, 0.00556…
## $ lift       <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.7131,…
## $ count      <int> 213, 205, 213, 201, 209, 209, 205, 204, 204, 203, 203, 232,…

We now want to inspect this table using the ruleExplorer()

basket_apriori %>% ruleExplorer()

To help visualise these rules, we can produce a basic scatterplot of the metrics.

ggplot(basket_apriori_tbl) +
  geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
  xlab("Rule Confidence") +
  ylab("Rule Lift") +
  ggtitle("Scatterplot of Association Rule Metrics")

2.3 Construct eclat Rules

An alternative method of constructing association rules is to use the eclat algorithm. The code for doing this is slightly different, but gives us similar outputs.

basket_eclat <- eclat(
    basket_arules,
    parameter = list(support = 0.005)
    ) %>%
  ruleInduction(
    basket_arules,
    confidence = 0.8
    )
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.005      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 197 
## 
## create itemset ... 
## set transactions ...[4895 item(s), 39510 transaction(s)] done [0.34s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating sparse bit matrix ... [1445 row(s), 39510 column(s)] done [0.01s].
## writing  ... [7742 set(s)] done [4.11s].
## Creating S4 object  ... done [0.00s].
basket_eclat_tbl <- basket_eclat %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))

basket_eclat_tbl %>% glimpse()
## Rows: 537
## Columns: 5
## $ rules      <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "{2…
## $ support    <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.00528…
## $ confidence <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.93…
## $ lift       <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.7131,…
## $ itemset    <int> 69, 41, 69, 39, 73, 73, 41, 26, 26, 38, 68, 76, 76, 40, 69,…

Once again, we inspect the data using ruleExplorer()

basket_eclat %>% ruleExplorer()

2.4 Compare Algorithms

We now want to compare the outputs of both algorithms in terms of association rules and how they compare.

basket_ap_tbl <- basket_apriori_tbl %>%
  select(rules, support, confidence, lift)

basket_ec_tbl <- basket_eclat_tbl %>%
  select(rules, support, confidence, lift)

rules_comparison_tbl <- basket_ap_tbl %>%
  full_join(basket_ec_tbl, by = "rules", suffix = c("_a", "_e"))

rules_comparison_tbl %>% glimpse()
## Rows: 537
## Columns: 7
## $ rules        <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "…
## $ support_a    <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.005…
## $ confidence_a <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.…
## $ lift_a       <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.713…
## $ support_e    <dbl> 0.005391040, 0.005188560, 0.005391040, 0.005087320, 0.005…
## $ confidence_e <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.…
## $ lift_e       <dbl> 150.2120, 150.1919, 149.0864, 149.0356, 148.9464, 148.713…

2.5 Reducing Minimum Confidence

While high confidence rules are useful, they are more likely to find rules that are “obvious” as the probabilities are such that co-occuring basket items will be noticed as being together - or possibly be natural complements: butter, milk and bread is a good example.

Instead, we are also interested in less obvious rules, and so we reduce our confidence threshold and see how many additional rules are discovered.

basket_lower_rules <- apriori(
    basket_arules,
    parameter = list(supp = 0.005, conf = 0.4)
  )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 197 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4895 item(s), 39510 transaction(s)] done [0.33s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.18s].
## writing ... [5950 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
basket_lower_rules_tbl <- basket_lower_rules %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))
ggplot(basket_lower_rules_tbl) +
  geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
  xlab("Rule Confidence") +
  ylab("Rule Lift") +
  ggtitle("Scatterplot of Association Rule Metrics")

3 Converting Rules to Graphs

We also have the ability to convert these rules to a graph representation, where each node is either a stock_code or a rule, with the edges of the graph representing that item being contained in the rule.

apriori_rules_igraph <- basket_apriori %>%
  plot(
    measure = "support",
    method  = "graph",
    control = list(max = 1000)
    ) %>%
  as("igraph")

apriori_rules_igraph %>% print()
## IGRAPH 8bbdd8a DN-- 679 2001 -- 
## + attr: name (v/c), label (v/c), support (v/n), confidence (v/n),
## | coverage (v/n), lift (v/n), count (v/n), order (v/n)
## + edges from 8bbdd8a (vertex names):
##  [1] 1732->assoc1  1255->assoc2  2120->assoc3  2117->assoc4  2120->assoc5 
##  [6] 2115->assoc6  2120->assoc7  2118->assoc8  2117->assoc9  2118->assoc10
## [11] 2119->assoc11 2118->assoc12 2115->assoc13 2118->assoc14 2116->assoc15
## [16] 2117->assoc16 2119->assoc17 2117->assoc18 2115->assoc19 2117->assoc20
## [21] 2116->assoc21 2119->assoc22 2115->assoc23 2119->assoc24 2116->assoc25
## [26] 2115->assoc26 2116->assoc27 2360->assoc28 2360->assoc29 1950->assoc30
## [31] 1950->assoc31 1950->assoc32 3748->assoc33 2359->assoc34 1949->assoc35
## + ... omitted several edges

We should first visualise this graph, using the top 30 rules in the dataset, as measured by the support of the rule.

basket_apriori %>%
  head(n = 30, by = "support") %>%
  plot(
    measure  = "lift",
    method   = "graph",
    engine   = "htmlwidget"
    )

3.1 Extract Connected Product Labels

First we want to look at the different disjoint components of the graph, and label them with an ID.

apriori_rules_tblgraph <- apriori_rules_igraph %>%
  igraph::as.undirected(mode = "collapse") %>%
  as_tbl_graph() %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

We then want to create groups of common products that form a disjoint cluster within this graph.

product_groups_all_tbl <- apriori_rules_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(component_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(
    product_group_id = component_id,
    product_count,
    stock_code = label
    ) %>%
  arrange(product_group_id, stock_code)

product_groups_all_tbl %>% glimpse()
## Rows: 142
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count    <int> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 6…
## $ stock_code       <chr> "20711", "20712", "20713", "20718", "20719", "20723",…

For display purposes, we can show all the stock_id values in a list.

3.1.1 Cluster Larger Groups

Within the large disjoint cluster there are a large number of products so rather than treating this as a single group we instead may investigate using further graph clustering algorithms to create further groupings.

apriori_rules_large_tblgraph <- apriori_rules_tblgraph %>%
  to_subgraph(component_size == max(component_size)) %>%
  use_series(subgraph) %>%
  morph(to_undirected) %>%
  mutate(
    sub_id = group_louvain()
    ) %>%
  unmorph()

Now that we have sub-divided this large subgraph, we repeat the process.

product_groups_largest_tbl <- apriori_rules_large_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(sub_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(
    product_group_id = sub_id, product_count, stock_code = label
    ) %>%
  arrange(product_group_id, stock_code)

product_groups_largest_tbl %>% glimpse()
## Rows: 67
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,…
## $ product_count    <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 16, 1…
## $ stock_code       <chr> "20719", "20723", "20724", "20725", "20727", "20728",…

Finally, it is worth trying to use an interactive tool to investigate this subgraph, we we can use visNetwork() to produce an interactive JS tool

apriori_rules_large_tblgraph %>%
  toVisNetworkData(idToLabel = FALSE) %>%
  visNetwork(
    nodes = .$nodes %>% transmute(id, label, group = sub_id),
    edges = .$edges
    )

3.2 Evaluating Product Groups

How do we go about assessing the validity of these product groups?

Note that this work is exploratory - in effect this is more sophisticated data exploration. Rather than use this model to make predictions - a job we will need to do at some point, we instead just want to assess how novel these grouping are.

To that end, it may be useful to check the co-occurrence of these products as a group - for each purchase made by a customer, what proportion of the group was featured in this data?

This question is worth exploring, so we should write some code to assess this.

Before we do this, we combine our two lists of product groups into a single table.

product_groups_tbl <- list(
    ALL = product_groups_all_tbl,
    LRG = product_groups_largest_tbl
    ) %>%
  bind_rows(.id = "type") %>%
  mutate(
    group_label = sprintf("%s_%02d", type, product_group_id)
    ) %>%
  group_by(group_label) %>%
  mutate(
    group_size  = n()
    ) %>%
  ungroup() %>%
  select(group_label, group_size, stock_code)

product_groups_tbl %>% glimpse()
## Rows: 209
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size  <int> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67…
## $ stock_code  <chr> "20711", "20712", "20713", "20718", "20719", "20723", "207…
tnx_groups_tbl <- tnx_data_tbl %>%
  select(invoice_id, invoice_date, stock_code) %>%
  group_nest(invoice_id, .key = "invoice_data")

group_props_tbl <- product_groups_tbl %>%
  group_nest(group_label, group_size, .key = "stock_data") %>%
  filter(group_size > 1, group_size < 15) %>%
  expand_grid(tnx_groups_tbl) %>%
  mutate(
    comb_data = future_map2(
      invoice_data, stock_data,
      inner_join,
      by = "stock_code",
    
      .options = furrr_options(globals = FALSE)
      ),
    match_count = map_int(comb_data, nrow),
    group_prop  = match_count / group_size
    ) %>%
  select(group_label, group_size, group_prop) %>%
  filter(group_prop > 0)

group_props_tbl %>% glimpse()
## Rows: 54,147
## Columns: 3
## $ group_label <chr> "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02"…
## $ group_size  <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ group_prop  <dbl> 0.2, 0.6, 0.3, 0.3, 0.1, 0.7, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2…

We now create a histogram of the proportions for each group, and this gives us a gauge of the ‘novelty’ of each of these groups.

plot_tbl <- group_props_tbl %>%
  mutate(label = glue("{group_label} ({group_size})"))

ggplot(plot_tbl) +
  geom_histogram(aes(x = group_prop), binwidth = 0.1) +
  facet_wrap(vars(label), scales = "free_y") +
  scale_y_continuous(labels = label_comma()) +
  xlab("Proportion") +
  ylab("Purchase Count") +
  ggtitle("Facetted Histograms of Group Coverages by Product Grouping") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

3.2.1 Investigate Groups

Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.

product_groups_tbl %>%
  filter(group_size > 1, group_size < 15) %>%
  mutate(stock_code_upr = stock_code %>% str_trim() %>% str_to_upper()) %>%
  left_join(stock_code_lookups_tbl, by = "stock_code_upr") %>%
  datatable()

4 Investigate Lower Support Rules

Our previous analysis was all based on rules with a minimum confidence of 0.80 so we now want to repeat our analysis but on this new set of rules.

apriori_lower_rules_igraph <- basket_lower_rules %>%
  plot(
    measure = "support",
    method  = "graph",
    control = list(max = 5000)
    ) %>%
  as("igraph")

apriori_lower_rules_igraph %>% glimpse()
## List of 10
##  $ :List of 1
##   ..$ 25: 'igraph.vs' Named int [1:4] 2315 2317 4812 4974
##   .. ..- attr(*, "names")= chr [1:4] "assoc1855" "assoc1857" "assoc4352" "assoc4514"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 27: 'igraph.vs' Named int [1:3] 2316 2317 4975
##   .. ..- attr(*, "names")= chr [1:3] "assoc1856" "assoc1857" "assoc4515"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 29: 'igraph.vs' Named int 1743
##   .. ..- attr(*, "names")= chr "assoc1283"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 217: 'igraph.vs' Named int [1:9] 1595 1596 2463 2464 2942 2943 3557 4127 4485
##   .. ..- attr(*, "names")= chr [1:9] "assoc1135" "assoc1136" "assoc2003" "assoc2004" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 218: 'igraph.vs' Named int [1:42] 561 562 563 1227 1228 1362 1363 1364 1592 1594 ...
##   .. ..- attr(*, "names")= chr [1:42] "assoc101" "assoc102" "assoc103" "assoc767" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 219: 'igraph.vs' Named int [1:23] 531 532 622 623 1228 1229 1298 1299 1596 1597 ...
##   .. ..- attr(*, "names")= chr [1:23] "assoc71" "assoc72" "assoc162" "assoc163" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 220: 'igraph.vs' Named int [1:42] 521 522 530 532 561 563 564 621 622 1545 ...
##   .. ..- attr(*, "names")= chr [1:42] "assoc61" "assoc62" "assoc70" "assoc72" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 222: 'igraph.vs' Named int [1:4] 2315 2316 4473 4811
##   .. ..- attr(*, "names")= chr [1:4] "assoc1855" "assoc1856" "assoc4013" "assoc4351"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 247: 'igraph.vs' Named int [1:91] 538 539 624 790 791 792 872 873 874 897 ...
##   .. ..- attr(*, "names")= chr [1:91] "assoc78" "assoc79" "assoc164" "assoc330" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  $ :List of 1
##   ..$ 248: 'igraph.vs' Named int [1:152] 539 540 552 554 555 574 575 576 703 759 ...
##   .. ..- attr(*, "names")= chr [1:152] "assoc79" "assoc80" "assoc92" "assoc94" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "74b754c9-a0c9-45b1-a997-1675725ad0c3"
##  - attr(*, "class")= chr "igraph"

Once again we have a quick look at the top 50 rules.

basket_lower_rules %>%
  head(n = 50, by = "support") %>%
  plot(
    measure  = "lift",
    method   = "graph",
    engine   = "htmlwidget"
    )

4.1 Determine Distinct Rules Subgraphs

Having converted the association rules to the graph, we then look at the distinct components of this graph and use these as our first pass at these clusters.

apriori_lower_rules_tblgraph <- apriori_lower_rules_igraph %>%
  igraph::as.undirected(mode = "collapse") %>%
  as_tbl_graph() %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

apriori_lower_rules_tblgraph %>% print()
## # A tbl_graph: 5460 nodes and 15770 edges
## #
## # An undirected simple graph with 88 components
## #
## # Node Data: 5,460 x 10 (active)
##   name  label support confidence coverage  lift count order component_id
##   <chr> <chr>   <dbl>      <dbl>    <dbl> <dbl> <int> <int>        <int>
## 1 25    1505…      NA         NA       NA    NA    NA    NA           14
## 2 27    1505…      NA         NA       NA    NA    NA    NA           14
## 3 29    1505…      NA         NA       NA    NA    NA    NA           14
## 4 217   20674      NA         NA       NA    NA    NA    NA            2
## 5 218   20675      NA         NA       NA    NA    NA    NA            2
## 6 219   20676      NA         NA       NA    NA    NA    NA            2
## # … with 5,454 more rows, and 1 more variable: component_size <int>
## #
## # Edge Data: 15,770 x 2
##    from    to
##   <int> <int>
## 1    11   461
## 2   217   461
## 3   218   461
## # … with 15,767 more rows
product_groups_lower_all_tbl <- apriori_lower_rules_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(component_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(product_group_id = component_id, product_count, stock_code = label) %>%
  arrange(product_group_id, stock_code)

product_groups_lower_all_tbl %>% glimpse()
## Rows: 460
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count    <int> 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216…
## $ stock_code       <chr> "20711", "20712", "20713", "20717", "20718", "20719",…
apriori_lower_rules_bigcomp_tblgraph <- apriori_lower_rules_tblgraph %>%
  to_subgraph(component_size == max(component_size)) %>%
  use_series(subgraph) %>%
  mutate(
    sub_id = group_louvain()
    )

apriori_lower_rules_bigcomp_tblgraph %>% print()
## # A tbl_graph: 4456 nodes and 13807 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 4,456 x 11 (active)
##   name  label support confidence coverage  lift count order component_id
##   <chr> <chr>   <dbl>      <dbl>    <dbl> <dbl> <int> <int>        <int>
## 1 247   20711      NA         NA       NA    NA    NA    NA            1
## 2 248   20712      NA         NA       NA    NA    NA    NA            1
## 3 249   20713      NA         NA       NA    NA    NA    NA            1
## 4 253   20717      NA         NA       NA    NA    NA    NA            1
## 5 254   20718      NA         NA       NA    NA    NA    NA            1
## 6 255   20719      NA         NA       NA    NA    NA    NA            1
## # … with 4,450 more rows, and 2 more variables: component_size <int>,
## #   sub_id <int>
## #
## # Edge Data: 13,807 x 2
##    from    to
##   <int> <int>
## 1     3   217
## 2    98   217
## 3    99   217
## # … with 13,804 more rows
product_groups_lower_bigcomp_tbl <- apriori_lower_rules_bigcomp_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(sub_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(product_group_id = sub_id, product_count, stock_code = label) %>%
  arrange(product_group_id, stock_code)

product_groups_lower_bigcomp_tbl %>% glimpse()
## Rows: 216
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count    <int> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 2…
## $ stock_code       <chr> "20711", "20712", "20713", "20717", "20718", "21033",…
product_groups_lower_tbl <- list(
    ALL = product_groups_lower_all_tbl,
    LRG = product_groups_lower_bigcomp_tbl
    ) %>%
  bind_rows(.id = "type") %>%
  mutate(
    group_label = sprintf("%s_%02d", type, product_group_id)
    ) %>%
  group_by(group_label) %>%
  mutate(
    group_size = n()
    ) %>%
  ungroup() %>%
  select(group_label, group_size, stock_code)

product_groups_lower_tbl %>% glimpse()
## Rows: 676
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size  <int> 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216…
## $ stock_code  <chr> "20711", "20712", "20713", "20717", "20718", "20719", "207…

Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.

product_groups_lower_tbl %>%
  filter(group_size > 1, group_size != max(group_size)) %>%
  mutate(stock_code_upr = stock_code %>% str_trim() %>% str_to_upper()) %>%
  left_join(stock_code_lookups_tbl, by = "stock_code_upr") %>%
  datatable()

5 Output Data to Disk

We now want to write the various data groups to disk.

As this may be useful for later analysis and for later modelling, we output these groupings for later use.

product_groups_tbl       %>% write_rds("data/product_groups_tbl.rds")
product_groups_lower_tbl %>% write_rds("data/product_groups_lower_tbl.rds")

6 R Environment

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       Ubuntu 20.04.2 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2021-08-16                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  arules      * 1.6-7   2021-03-16 [1] RSPM (R 4.0.4)
##  arulesViz   * 1.4-0   2021-03-07 [1] RSPM (R 4.0.3)
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports     1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  bookdown      0.21    2020-10-13 [1] RSPM (R 4.0.2)
##  broom         0.7.5   2021-02-19 [1] RSPM (R 4.0.3)
##  bslib         0.2.4   2021-01-25 [1] RSPM (R 4.0.3)
##  cachem        1.0.4   2021-02-13 [1] RSPM (R 4.0.3)
##  cellranger    1.1.0   2016-07-27 [1] RSPM (R 4.0.3)
##  cli           2.3.1   2021-02-23 [1] RSPM (R 4.0.3)
##  codetools     0.2-18  2020-11-04 [2] CRAN (R 4.0.4)
##  colorspace    2.0-0   2020-11-11 [1] RSPM (R 4.0.3)
##  conflicted  * 1.0.4   2019-06-21 [1] RSPM (R 4.0.0)
##  cowplot     * 1.1.1   2020-12-30 [1] RSPM (R 4.0.3)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  crosstalk     1.1.1   2021-01-12 [1] RSPM (R 4.0.3)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  dbplyr        2.1.0   2021-02-03 [1] RSPM (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr       * 1.0.5   2021-03-05 [1] RSPM (R 4.0.3)
##  DT          * 0.17    2021-01-06 [1] RSPM (R 4.0.3)
##  ellipsis      0.3.1   2020-05-15 [1] RSPM (R 4.0.3)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi         0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  farver        2.1.0   2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap       1.1.0   2021-01-25 [1] RSPM (R 4.0.3)
##  forcats     * 0.5.1   2021-01-27 [1] RSPM (R 4.0.3)
##  foreach       1.5.1   2020-10-15 [1] RSPM (R 4.0.3)
##  fs            1.5.0   2020-07-31 [1] RSPM (R 4.0.3)
##  furrr       * 0.2.2   2021-01-29 [1] RSPM (R 4.0.3)
##  future      * 1.21.0  2020-12-10 [1] RSPM (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  ggplot2     * 3.3.3   2020-12-30 [1] RSPM (R 4.0.3)
##  globals       0.14.0  2020-11-22 [1] RSPM (R 4.0.3)
##  glue        * 1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  gtable        0.3.0   2019-03-25 [1] RSPM (R 4.0.3)
##  haven         2.3.1   2020-06-01 [1] RSPM (R 4.0.3)
##  highr         0.8     2019-03-20 [1] RSPM (R 4.0.3)
##  hms           1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools     0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  htmlwidgets   1.5.3   2020-12-10 [1] RSPM (R 4.0.3)
##  httr          1.4.2   2020-07-20 [1] RSPM (R 4.0.3)
##  igraph        1.2.6   2020-10-06 [1] RSPM (R 4.0.3)
##  iterators     1.0.13  2020-10-15 [1] RSPM (R 4.0.3)
##  jquerylib     0.1.3   2020-12-17 [1] RSPM (R 4.0.3)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr         1.31    2021-01-27 [1] RSPM (R 4.0.3)
##  labeling      0.4.2   2020-10-20 [1] RSPM (R 4.0.3)
##  lattice       0.20-41 2020-04-02 [2] CRAN (R 4.0.4)
##  lifecycle     1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  listenv       0.8.0   2019-12-05 [1] RSPM (R 4.0.3)
##  lubridate     1.7.10  2021-02-26 [1] RSPM (R 4.0.3)
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  Matrix      * 1.3-2   2021-01-06 [2] CRAN (R 4.0.4)
##  memoise       2.0.0   2021-01-26 [1] RSPM (R 4.0.3)
##  modelr        0.1.8   2020-05-19 [1] RSPM (R 4.0.3)
##  munsell       0.5.0   2018-06-12 [1] RSPM (R 4.0.3)
##  parallelly    1.24.0  2021-03-14 [1] RSPM (R 4.0.3)
##  pillar        1.5.1   2021-03-05 [1] RSPM (R 4.0.3)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  ps            1.6.0   2021-02-28 [1] RSPM (R 4.0.3)
##  purrr       * 0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  Rcpp          1.0.6   2021-01-15 [1] RSPM (R 4.0.3)
##  readr       * 1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  readxl        1.3.1   2019-03-13 [1] RSPM (R 4.0.3)
##  registry      0.5-1   2019-03-05 [1] RSPM (R 4.0.0)
##  reprex        1.0.0   2021-01-27 [1] RSPM (R 4.0.3)
##  rlang       * 0.4.10  2020-12-30 [1] RSPM (R 4.0.3)
##  rmarkdown     2.7     2021-02-19 [1] RSPM (R 4.0.3)
##  rmdformats    1.0.1   2021-01-13 [1] RSPM (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  rvest         1.0.0   2021-03-09 [1] RSPM (R 4.0.3)
##  sass          0.3.1   2021-01-24 [1] RSPM (R 4.0.3)
##  scales      * 1.1.1   2020-05-11 [1] RSPM (R 4.0.3)
##  seriation     1.2-9   2020-10-01 [1] RSPM (R 4.0.2)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi       1.5.3   2020-09-09 [1] RSPM (R 4.0.3)
##  stringr     * 1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  tibble      * 3.1.0   2021-02-25 [1] RSPM (R 4.0.3)
##  tidygraph   * 1.2.0   2020-05-12 [1] RSPM (R 4.0.3)
##  tidyr       * 1.1.3   2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect    1.1.0   2020-05-11 [1] RSPM (R 4.0.3)
##  tidyverse   * 1.3.0   2019-11-21 [1] RSPM (R 4.0.3)
##  TSP           1.1-10  2020-04-17 [1] RSPM (R 4.0.0)
##  utf8          1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs         0.3.7   2021-03-29 [1] RSPM (R 4.0.4)
##  visNetwork    2.0.9   2019-12-06 [1] RSPM (R 4.0.3)
##  withr         2.4.1   2021-01-26 [1] RSPM (R 4.0.3)
##  xfun          0.22    2021-03-11 [1] RSPM (R 4.0.3)
##  xml2          1.3.2   2020-04-23 [1] RSPM (R 4.0.3)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library